Group members: Benjamin Vidmar, Lynn Vandenbusch, Luke Lanskey, Gabrielle Beaulieu
Abstract:
In this project, we examine the relationship between socioeconomic characteristics and political ideology across European regions using European Union election data and socioeconomic indicators.
Using Lasso, Ridge, and Elastic Net and the non-linear learning methods of k-Nearest Neighbors, Random Forest, and Gradient Boosting, we assess whether socioeconomic data can predict political ideology. Overall the models perform reasonably well with the R-squared ranging between 0.3 and 0.4. The linear models identify poverty rate as the strongest left-leaning predictor, while labor productivity and the gender employment gap are associated with right-leaning tendencies. For the non-linear models, random forest performs best and finds productivity to be the strongest predictor. Our findings suggest that socioeconomic factors are a reasonable predictor of political ideology and highlight the role of machine learning algorithms in capturing the non-linearities in voting behavior.
Using European election data, we analyze the impact of socioeconomic characteristics on political ideology. This is an important topic in the political sciences. In particular, for polling firms, understanding how different socioeconomic characteristics influence ideology is a key input for predicting election outcomes. Likewise, for policymakers, analyzing how demographic changes influence ideological trends can shape policies and messaging (Kuha, 2022).
There have been a range of studies on the impact of socioeconomic variables on individual countries. For example, the Pew Research Centre (2024) finds strong divides in terms of income, education, gender, and race between Democrats and Republicans in the United States. In the UK, Ansel and Gingrich (2022) find significant relationships between income, housing, education, and age on the probability of voting Labour or Conservative in General Elections, but these relationships have changed over time. Notably, income has become less of a predictor over time. Here we extend the literature to studying the impact of demography and socioeconomic status on ideology in the European Union.
We utilize three main datasources for this project. First, we draw on data from the European Parliament Nomenclature of Territorial Units for Statistics (NUTS) level election database (Schraff et. al 2022), which provides a regional breakdown of general and European election data across the European NUTS2 regions—a classification to standardize regional divisions within the European Union. Second, we use data from Caravaca et. al, which places European parties on a left-right political ideology spectrum. The left-right political spectrum is developed using a combination of the ParlGov database and social media posts to place parties on a 1-10 scale, with 1 representing a left-most ideology and 10 a right-most (see Caravaca et. al 2022 for further details). We then develop a weighted average ideology for each NUTS2 region by combining the two datasets. This is our 'response variable.' We scrape a range of socioeconomic data, encompassing income, demography, health, inequality, and migration from Eurostat by NUTS2 region to use as predictor variables (Eurostat 2024). We use feature engineering, including NA imputation and transformations to enhance the quality of our predictions.
Various supervised and unsupervised learning techniques are employed to fit our model including Lasso, Ridge, Elastic Net, k-Nearest Neighbors, Gradient Boosting, and Random Forest. Through these different techniques, we aim to gain insights into whether socioeconomic data is a good predictor for political ideology and which are the most important predictors. By evaluation, we determine whether the nature of this relationship is linear or if non-linearities also exist. Throughout, we principally rely on cross-validation for training our models, testing, and preventing overfitting.
Voting Data: Political Leaning
To determine the political leaning of EU regions based on voting outcomes, a weighted average of the left-right scores for each region is calculated using the European election data joined with the party ideology data. Missing values in the dataset, primarily due to small parties with incomplete data, are addressed through row-wise deletion. The weighted average of the left-right scores is calculated by using the number of votes each party received as weights. The final dataset presents an indication of whether a region leans left or right politically.
Socioeconomic Data: Overview of the Data Cleaning Process for the Eurostat Socioeconomic Data
Socioeconomic variables from the Eurostat API are scraped, cleaned, and merged into a single database to serve as explanatory factors in the project's models. All socioeconomic variables provided by Eurostat at the NUTS2 level are evaluated, with only those potentially relevant to the analysis imported.
Data Filtering
Rows containing EU-wide totals are excluded. The United Kingdom is removed to exclude potential noise caused by Brexit. Additionally, the turnout for the election in the UK was only 36.9% (European Parliament, 2024) and is therefore not very representative.
Missing Data (NAs): Analysis
The individual variables are consolidated into a single database using the outer merge functionality on the geographic keys, aligning all data by region. The outer merge introduces missing values due to the inclusion of regions with partial data. The dataset is filtered to retain only the 235 regions in the voting dataset.
As shown in the bar chart, the majority of missing values in the outer-merged dataset are concentrated in percentage_individuals_regularly_using_internet, , railway_km_per_square_km, available_hospital_beds_per_100k and poverty_rate.
The missing data is classified as Missing at Random (MAR), meaning the likelihood of missingness depends on observed factors, such as the country to which the data belongs. For example, as seen in the histogram, Germany exhibits the highest count of missing values for railway_km_per_square_km, percentage_individuals_regularly_using_internet, and available_hospital_beds_per_100k, while data for all other variables is complete.
The heatmap demonstrates this pattern can be generalized to most countries and variables. Missingness tends to be either complete (100%) or absent (0%) for most country-variable combinations implying a correlation between country and data availability.
MAR implies valid inferences are still possible despite missing data. We use various imputation methods to handle the NA values to ensure there is enough data to fit the model.
An analysis of the correlation between missing values and certain regional characteristics, such as disposable_income_per_capita (0.097) and population_density (0.057), reveal minimal relationships. This analysis is conducted as regions with lower income levels or more rural areas might have lower reporting rates; however, the findings suggest this is not the case, likely because the data is collected at the country level. This indicates missingness is driven by country-specific reporting practices or variable-specific data collection priorities. It is also important to note that the missingness is not correlated with what the models are predicting because left-or-right-wing regions are not more or less likely to be missing.
Missing Data (NAs): Cleaning
For all columns with less than 1% missing values, the missing values are imputed using median or mean depending on the outliers. Country-level medians are used to replace the missing entries for gender_employment_gap_proportion, gross_income_per_capita, proportion_with_higher_educ, purchasing_power_standard_per_inhabitant, and raw_milk_on_farms. Due to the presence of outliers in disposable_income, employment_rate, life_expectancy, and real_labour_productivity_per_person_employed, missing values in these variables are replaced with the median country values so the estimates are not skewed, as shown in the normalized boxplot.
Columns with more than 1% missing values are handled differently. Poverty rate is hypothesized to be a significant factor in explaining voting patterns. About 85% of missing poverty_rate values are from France and Belgium. To address this, regional poverty data for 2019 is retrieved from the Belgian and French Statistics offices. Belgium’s data is directly merged into the dataframe as it is properly aligned with the NUTS2 level. The French poverty data, however, does not include social exclusion, which is part of EUROSTAT's definition of poverty. The difference between France’s total poverty rate and its poverty rate including social exclusion is calculated and proportionally applied to the French regions to account for this difference.
For the remaining missing poverty_rate values (15.6% of total), the mean values of the countries are used. This decision is based on the small proportion of missing data and the absence of significant outliers in the variable.
Conversely, the variables of percentage_individuals_regularly_using_internet, railway_kilometer_per_square_kilometer, and available_hospital_beds_per_100k are dropped as their missing values are spread across many countries and they are less likely to be important explanatory variables. Similarly, the entire country of Malta is excluded from the dataset as it has 100% missing values in six columns.
Merge of Voting and Socioeconomic Data
The EUROSTAT socioeconomic data and voting data are merged on the common geography key. The final dataset, which is used for the analysis, comprises 16 independent variables and the dependent variable, weighted_avg_left_right, representing political orientation.
Response Variable - Weighted Average Ideology
The map shown below visualizes the left-right political index across Europe, with regions color-coded to represent varying political orientations on a scale from left (blue) to right (red). Notably, there are clear regional clusterings of political leanings, where some countries display regional variations while others appear to be more politically uniform.
The dependent variable, weighted_avg_left_right, is centered around 5.5, with minimal skew and no extreme outliers, making it suitable for statistical modeling without any feature engineering (see Table).
Most regions fall between 5 and 6 on the political index, reflecting moderate political preferences. However, there are variations, with certain regions leaning distinctly right while others lean more to the left. To gain deeper insights, we analyze a subset of the extreme regions to explore their unique characteristics. Differences in the extremes align with economic factors. The Figure below shows that left-leaning regions often exhibit higher purchasing power, while right-leaning regions, tend to have lower economic prosperity.
Predictive Variables
Among the independent variables, purchasing_power_standard_per_inhabitant shows the largest range and a high standard deviation, reflecting significant regional economic disparities. The variable's right-skewed distribution suggests a small number of regions with disproportionately high purchasing power. Similarly, disposable_income_avg has a substantial range and high standard deviation, highlighting its variability across regions, likely linked to economic or demographic factors. Conversely, variables like proportion_with_higher_educ and gender_employment_gap_proportion show low variability and near-symmetrical distributions.
Note, the variable disposable_income_avg is at the household level which is why it has larger mininimum, maximum, and mean values than the gross_income_per_capita.
Interrelations between Variables
The correlation plot reveals how the independent variables relate to one another. Strong positive correlations include those between gross_income_per_capita and disposable_income_avg (0.79), indicating consistent income dynamics across regions, and between employment_rate and proportion_with_higher_educ (0.52), suggesting that regions with higher employment rates are likely to be associated with greater levels of educational attainment. Negative correlations, such as between poverty_rate and life_expectancy (-0.47), highlight the inverse relationship between poverty and socioeconomic development.
The bar plot examines correlations between independent variables and weighted_avg_left_right. Positive correlations include labor productivity, gender employment gap, and fertility rate, while negative correlations include gross income, migration, and life expectancy. This highlights how certain socioeconomic variables can influence political orientations.
Transformations
The descriptive statistics reveal skewness and outliers in pop_density_per_square_km, disposable_income_avg, fertility_rate, and purchasing_power_standard_per_inhabitant. To mitigate associated issues, log transformations are applied to these predictive variables.
To identify any non-linear relationships, scatterplots between each independent variable and the dependent variable are utilized. No obvious examples are identified, though poverty_rate is initially hypothesized as a potential non-linear relationship, based on early analysis scatterplots. Upon further analysis, however, the relationship between poverty_rate and weighted_avg_left_right is determined to be more a function of different countries clustering together (as can be seen in the below scatterplot), and thus a non-linear relationship is deemed unlikely. The clustering effect is confirmed when scatterplots are color-coded by the country of each NUTS2 region, highlighting distinct groupings that lacked any apparent non-linear relationships.
Nonetheless, to confirm the absence of non-linear relationships, early LASSO models were run with polynomial transformations of poverty_rate from the second to the fourth power. Confirming the hypothesis, these early models consistently dropped every higher-order term, demonstrating that non-linear transformations did not yield any additional predictive value. Therefore, the primary LASSO, Ridge, and Elastic Net models are run without any polynomial transformations.
Potential multicollinearity is identified between gross_income_per_capita and purchasing_power_standard_per_inhabitant (as mentioned above). Thus, gross_income_per_capita is dropped in the final models. We also drop raw_milk_on_farms to reduce the number of predictive variables. The final linear models are run with 16 predictors.
A note on country effects:
While country-level clustering is evident in the scatterplots (see example below), the models intentionally do not include country dummy variables, treating each NUTS2 region as an independent unit. This decision is based on our goal of understanding regional characteristics rather than attributing variance to national effects. Of course, country effects are highly important in voting patterns and socioeconomic dynamics, as can be seen in the below scatterplot. However, it is not feasible to add dummy variables for every country due to concerns regarding overfitting.
In this section, the performance of three linear models are compared and contrasted. In each analysis, the data is divided into training and test sets using a 75-25 split. Given the limitations stemming from the heterogenous number of NUTS2 regions per country (with some having only 1-2 regions/observations), the data was not stratified by country into training and test sets. However, fixed training and test sets were used for each linear model.
Prior to modeling, all predictors are standardized, ensuring a consistent scale across independent variables. Consequently, the coefficients presented in the Linear Analysis section reflect the effect of a one standard deviation increase in the independent variable on the dependent variable.
The Elastic Net and LASSO models perform very similarly, while the Ridge model performs only slightly worse.
The three models perform as follows:
LASSO's model selection provides an interesting starting point for our linear analysis. The model explains 28% of the variance in the dependent variable, performing better than the Ridge model and essentially the same as the Elastic Net.
At an optimal lambda of 0.04 (standarized by natural log to -3.15 in the chart below), the model retains 9 variables and drops 7. Poverty rate is by far the most significant, in which a one standard deviation increase in poverty_rate is associated with a 0.27 decrease in the dependent variable, weighted_avg_left_right (i.e., poverty rate is correlated with voting left). The most predictive variables are the following:
Meanwhile, the following 7 predictors are dropped from the model: pop_density_per_square_km, log_pop_density, disposable_income_avg, log_disposable_income, count_of_all_crime_per_100k, proportion_with_higher_educ, life_expectancy.
Of the three linear models, the Ridge Model performed worst, explaining 26.76% of the variance in the dependent variable. The model is specified to split the data into 10 folds, and at an optimal lambda of 0.59 (standardized by natural log to -0.52) the most predictive variables with the highest coefficients are the following:
The Elastic Net Model performs similarly to the LASSO model, explaining 28% of the variance in the dependent variable. The model divides the data into 10 folds, using an alpha of 0.5 to evenly split between LASSO and Ridge, and finds an optimal lambda of 0.07 (standardized by natural log to -2.55). The model retains 11 values and drops 5, retaining the count_of_all_crime_per_100k and log_disposable_income predictors that were dropped by LASSO.
As this model was our best performing (slightly outperforming LASSO), the full results are displayed below.
Overall, poverty_rate remains the most predictive variable in the Elastic Net model - both overall and in terms of voting left - as a one standard deviation increase in the poverty_rate is associated with a -0.25 (leftward) shift voting behavior. real_labor_productivity_per_employed_2019_base_year is the second-most significant variable, and the most significant variable in terms of predicting right-leaning voting behavior. Net_migration_per_capita is also a top 3 predictor correlated with voting left.
In this section, we consider the performance of a range of non-linear methods for predicting political ideology, namely:
To do this, we first split our data into training, validation and test sets based on random draws (with replacement) of probability 0.4, 0.4 and 0.2, respectively. As with the linear models, we avoid stratifying by country given the small number of regions per country available in the dataset. To prevent overfitting, we subsetted the list of variables from the initial 16 to just the top 5 performing predictors from the linear model analysis, namely poverty rate, labor productivity, net migration, gender employment gap and average age.
The three models perform as follows:
(Note: the MSE: $1-R^2$ ratio is higher than in the linear models section due to higher variance of the predictor in the randomly drawn test dataset for the non-linear models. Despite this, we see the MSE for random forest is still lower than the best performing linear model.)
k-Nearest Neighbors
We first tune k using the validation data to find the k which minimizes the mean-squared error (MSE). The chart below shows the MSE when changing the parameter k. We find that k=6 performs best with the validation data and more generally, a k between 5 and 13 lead to similar levels of MSE. We finally test this on our test dataset and compute an MSE of 0.660.
Random Forest
For random forest, we tune the parameters for the number of features to include with each tree, and the optimal number of trees. The charts below show the outcomes from optimizing the tuning parameters. We find an optimal feature selection of 1. For the number of trees, the optimal parameter is 900, but variation in trees does not greatly impact performance.
Using the optimized parameters, we retrain our model and test the predictions on our test dataset. We find an R-squared of 0.424 and an MSE of 0.591. We further look at the relative importance of each variable in the random forest. Overall, labor productivity is the most important predictor, followed by net migration and poverty rate, as shown in the chart below.
Lastly, to understand how these variables impact the response variable, we use partial dependence plots to look at the marginal impact of varying the dependent variable. Below is a chart showing the partial dependence plots associated with the top predictors. Our results show non-linear relationships, with the random forest model predicting that regions with higher poverty rates and net migration tend to vote more left, while regions with higher productivity and gender employment gaps vote more right. The directions match those of the linear regression analysis above. With regards to non-linearity, the random forest predicts (holding other factors constant), regions with very high labor productivity are skewed more to the right. Similarly, areas with very high net migration are skewed to the left but around the mean net migration rate, there is not a large difference between left-right voting patterns.
Gradient Boosting
The final non-linear method is gradient boosting. As above, we tune for the parameters using our validation data, namely the learning rate, the max depth of each tree, and the number of trees. We find that the model is best estimated with a low learning rate (0.01), a higher number of trees (900) and a max-depth of 1, suggesting gradient boosting favors a large number of 'weak learners.' The charts below show the results for this paramater tuning.
Using these parameters, we find an MSE of 0.64, which is worse than our random forest model but better than kNN. Like with random forest, productivity is the most important predictor while average age is the least important.
Given the random forest exhibited the best predictions of all models, we look at its distribution of errors. The chart below shows a regression of the actual values on the predicted and a 45-degree line which would show the 'perfect fit.' Our random forest has a systematic bias in terms of tending to predict more right-leaning regions as closer to the center and similar with left-leaning. It is, therefore, less good at predicting relative extremes in regional political ideology within Europe.
Regional socioeconomic factors are reasonable predictors of left-right voting patterns. The linear models identify the poverty rate as the strongest predictor of voting left, while labor productivity and the gender employment gap are positively associated with voting right. For the non-linear models, Random Forest demonstrates the highest predictive performance with productivity as the strongest predictor, but tends to underpredict extremes. For example, both far-left and far-right regions are predicted as more center. Nonetheless, our analysis shows machine learning models are valuable in capturing nonlinearities.
However, the analysis has several limitations. First, the reliance on cross-sectional data prevents us from observing temporal shifts in political ideology and their drivers. Future research could address this through time-series analyses to identify dynamic changes in variables that influence voting decisions and gain insights into shifting political trends.
Second, the models may lack relevant explanatory variables which could limit the predictive power. For example, the demographic characteristic of race is not considered in the models even though it may impact voting; race is not reported by EUROSTAT at the NUTS2 level. Additionally, the absence of cultural and institutional variables in our dataset may omit factors that shape political behavior.
Finally, the study’s generalizability could be affected by the low voter turnout of 50.66% in the 2019 European elections (European Parliament, 2022). Future studies could analyze regional elections with a higher voter turnout to increase reliability.
European Parliament. (2022). Turnout. Available at: https://results.elections.europa.eu/en/turnout/ (Accessed: 30 November 2024).
Pew Research Center. (2024). Partisanship by family income, home ownership, union membership and veteran status. Available at: https://www.pewresearch.org/politics/2024/04/09/partisanship-by-family-income-home-ownership-union-membership-and-veteran-status/ (Accessed: 30 November 2024).
Ansell, B., Gingrich, J., (2022). Political inequality. Available at: https://ifs.org.uk/publications/political-inequality (Accessed: 30 November 2024).
Kuha, J., (2022). The politics of polling: Why are polls important during elections?. Available at: https://www.lse.ac.uk/research/research-for-the-world/impact/the-politics-of-polling-why-are-polls-important-during-elections (Accessed: 30 November 2024).